Perspectives on Linguistics
2026-03-02
Drawing on data from our Schnëssen project, we seek to understand the driving factors behind the choice of a variant of Germanic origin (Luxembourgish or German) or French origin.
Example:
This linguistic choice as the dependent variable is influenced by social or other linguistic variables as independent variables:
Corollary aim of this presentation:
Collected with Schnëssen app
Sub-set for this study extracted as data frame/tibble in R.
The dataset has 38452 rows.
Give an overview as table:
The tibble input_data has the following structure.
Classes 'tbl_df', 'tbl' and 'data.frame': 38452 obs. of 16 variables:
$ id : Factor w/ 3151 levels "1000","1004",..: 393 452 508 555 726 783 811 821 828 849 ...
$ variant : Factor w/ 116 levels "Zwiwwel","Ënn",..: 1 1 2 2 1 2 2 2 2 2 ...
$ variable : Factor w/ 50 levels "Variant_Ënn",..: 1 1 1 1 1 1 1 1 1 1 ...
$ French_origin : chr "non-French" "non-French" "French" "French" ...
$ domain : chr "lexicon" "lexicon" "lexicon" "lexicon" ...
$ urbanity : Factor w/ 3 levels "smaller towns",..: 1 1 2 1 1 1 1 2 2 1 ...
$ socio_index : Factor w/ 15 levels "(0.122,0.278]",..: 5 1 5 5 4 2 2 2 5 1 ...
$ socio_index_raw : num 0.89 0.243 0.821 0.856 0.594 ...
$ region : Factor w/ 5 levels "Norden","Osten",..: 3 4 2 4 3 4 3 4 2 4 ...
$ first_language : Factor w/ 1 level "Jo": 1 1 1 1 1 1 1 1 1 1 ...
$ age6 : Ord.factor w/ 6 levels "≤ 24"<"25 bis 34"<..: 2 2 2 2 2 2 3 2 5 2 ...
$ age : Factor w/ 3 levels "young","middle-aged",..: 1 1 1 1 1 1 2 1 3 1 ...
$ gender : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 2 2 2 ...
$ education : Ord.factor w/ 4 levels "other"<"Technical\nschool"<..: 4 4 4 4 4 4 4 4 3 4 ...
$ competence_french: Factor w/ 3 levels "French\nlow",..: 3 2 2 2 2 3 3 2 3 3 ...
$ competence_german: Factor w/ 3 levels "German low","German average",..: 3 3 3 3 3 3 3 3 3 3 ...
id variant variable
1826 : 48 Kannapee : 1188 Variant_Déierendoktesch: 1704
6398 : 48 Toilettëpabeier: 1126 Variant_Kannapee : 1626
6410 : 48 Ënn : 1120 Variant_Ënn : 1589
1700 : 47 Kamäin : 1087 Variant_Schaarschtech : 1378
3818 : 47 Déierendokter : 1045 Variant_Schallimo : 1278
203 : 46 Kürbis : 1034 Variant_Kürbis : 1224
(Other):38168 (Other) :31852 (Other) :29653
French_origin domain urbanity
Length:38452 Length:38452 smaller towns :17007
Class :character Class :character rural areas :16651
Mode :character Mode :character 1\nStad Lëtzebuerg: 4794
socio_index socio_index_raw region first_language
(0.278,0.434]:11792 Min. :0.1230 Norden : 4060 Jo:38452
(0.589,0.744]: 7396 1st Qu.:0.3434 Osten : 6874
(0.434,0.589]: 7380 Median :0.4746 Süden :10643
(0.122,0.278]: 5969 Mean :0.4998 Zentrum:16875
(0.744,0.9] : 5619 3rd Qu.:0.6373 #ERROR!: 0
(0.313,0.46] : 52 Max. :0.8997
(Other) : 244
age6 age gender
≤ 24 : 7451 young :17977 male :11634
25 bis 34:10526 middle-aged:13313 female:26818
35 bis 44: 6799 old : 7162
45 bis 54: 6514
55 bis 64: 5106
65+ : 2056
education competence_french
other : 0 French\nlow : 2675
Technical\nschool :13431 French\naverage:18193
Classical\nsecondary\nschool: 6881 French\nhigh :17584
University :18140
competence_german
German low : 119
German average: 4377
German high :33956
Using crosstables - Responses by age
French non-French
young 6768 11209
middle-aged 5229 8084
old 3037 4125
French non-French
French\nlow 879 1796
French\naverage 6776 11417
French\nhigh 7379 10205
French non-French
French\nlow 0.02285967 0.04670758
French\naverage 0.17621970 0.29691564
French\nhigh 0.19190159 0.26539582
ids (= speakers) are in the dataset?Hypothesis: The choice of a French variant (dependent variable/response variable) is influenced by social factors (independent variables/predictors).
Choice of regression analysis dependent on the nature of the dependent variable:
Our dependent variable French_origin is binary, thus we will use logistic regression.
Prepare data
# see: https://slcladal.github.io/regression.html#Random_Effects
library(tidyverse)
library(lme4)
library(sjPlot)
data <- input_data %>%
# filter for domain (lexicon or phonology)
filter(domain == "lexicon") %>%
# convert binary variable to values 0 and 1
mutate(across(French_origin, str_replace, "non-French", "0")) %>%
mutate(across(French_origin, str_replace, "French", "1")) %>%
mutate(French_origin = as.integer(French_origin)) %>%
mutate(age = factor(age, ordered = FALSE)) %>%
mutate(age6 = factor(age6, ordered = TRUE)) %>%
mutate(education = factor(education, ordered = FALSE)) %>%
mutate(competence_german = factor(competence_german, ordered = FALSE)) %>%
mutate(competence_french = factor(competence_french, ordered = FALSE)) %>%
mutate(urbanity = factor(urbanity, ordered = FALSE))The regression model will be fit step by step (‘stepwise regression’) by starting out with a base model without predictors and then adding one predictor after the other. After each step the model will be compared statistically with the previous one. If better, the predictor is retained, if the model is worse or the same, the predictor is eliminated. Step by step then the explanatory predictors are detected.
Fit base model. Use id and variable as random effects.
Add age as first predictor.
m1.lmer <- lmer(formula=French_origin ~ age + (1|id) + (1|variable), REML = T, data = data)
tab_model(m1.lmer)| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.38 | 0.30 – 0.45 | <0.001 |
| age [middle-aged] | 0.02 | 0.01 – 0.03 | 0.001 |
| age [old] | 0.04 | 0.03 – 0.06 | <0.001 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.001 / 0.339 | ||
Compare which model is performing better.
Data: data
Models:
m0.lmer: French_origin ~ 1 + (1 | id) + (1 | variable)
m1.lmer: French_origin ~ age + (1 | id) + (1 | variable)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m0.lmer 4 38443 38477 -19218 38435
m1.lmer 6 38411 38462 -19199 38399 36.505 2 1.183e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Age is a significant contribution to the model. ✅ The age model is significantly better than the previous one.
Adding further predictors, starting with gender. Instead of refitting the model, we can update the model.
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.38 | 0.31 – 0.46 | <0.001 |
| age [middle-aged] | 0.02 | 0.01 – 0.03 | 0.001 |
| age [old] | 0.04 | 0.03 – 0.06 | <0.001 |
| gender [female] | -0.01 | -0.02 – 0.01 | 0.290 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.001 / 0.339 | ||
Compare again, which one is better.
Data: data
Models:
m1.lmer: French_origin ~ age + (1 | id) + (1 | variable)
m2.lmer: French_origin ~ age + (1 | id) + (1 | variable) + gender
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m1.lmer 6 38411 38462 -19199 38399
m2.lmer 7 38411 38471 -19199 38397 1.1189 1 0.2902
Gender is not significant. Remove it? Let’s try for the interaction with age.
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.38 | 0.31 – 0.46 | <0.001 |
| age [middle-aged] | 0.02 | -0.00 – 0.04 | 0.062 |
| age [old] | 0.04 | 0.01 – 0.06 | 0.003 |
| gender [female] | -0.01 | -0.02 – 0.01 | 0.388 |
| age [middle-aged] × gender [female] |
-0.00 | -0.03 – 0.02 | 0.926 |
| age [old] × gender [female] |
0.01 | -0.02 – 0.04 | 0.620 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.001 / 0.339 | ||
Data: data
Models:
m1.lmer: French_origin ~ age + (1 | id) + (1 | variable)
m2.lmer: French_origin ~ age + (1 | id) + (1 | variable) + gender + age:gender
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m1.lmer 6 38411 38462 -19199 38399
m2.lmer 9 38415 38492 -19199 38397 1.4425 3 0.6956
None of interactions is significant and the model is not performing better than the previous one. Gender will be removed ❌. Interestingly, and contrary to general sociolinguistic assumptions, gender seems to play no role in the choice of a French variant.
Add predictor competence_french.
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.32 | 0.25 – 0.40 | <0.001 |
| age [middle-aged] | 0.01 | -0.00 – 0.02 | 0.080 |
| age [old] | 0.03 | 0.02 – 0.05 | <0.001 |
| competence french [French average] |
0.04 | 0.02 – 0.07 | <0.001 |
| competence french [French high] |
0.09 | 0.06 – 0.11 | <0.001 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.004 / 0.339 | ||
Data: data
Models:
m1.lmer: French_origin ~ age + (1 | id) + (1 | variable)
m2.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m1.lmer 6 38411 38462 -19199 38399
m2.lmer 8 38324 38393 -19154 38308 90.154 2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected, this is one is performing well. ✅
Maybe competence_french is interacting with age?
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.32 | 0.24 – 0.40 | <0.001 |
| age [middle-aged] | -0.00 | -0.05 – 0.05 | 0.987 |
| age [old] | 0.07 | 0.00 – 0.14 | 0.047 |
| competence french [French average] |
0.05 | 0.02 – 0.07 | <0.001 |
| competence french [French high] |
0.08 | 0.06 – 0.11 | <0.001 |
| age [middle-aged] × competence french [French average] |
0.01 | -0.04 – 0.06 | 0.771 |
| age [old] × competence french [French average] |
-0.04 | -0.12 – 0.03 | 0.222 |
| age [middle-aged] × competence french [French high] |
0.02 | -0.03 – 0.07 | 0.518 |
| age [old] × competence french [French high] |
-0.03 | -0.10 – 0.04 | 0.427 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.004 / 0.339 | ||
Data: data
Models:
m2.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french
m3.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + age:competence_french
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m2.lmer 8 38324 38393 -19154 38308
m3.lmer 12 38330 38432 -19153 38306 2.7069 4 0.608
No interaction! Removed from the model. ❌
Add predictor competence_german.
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.34 | 0.22 – 0.46 | <0.001 |
| age [middle-aged] | 0.01 | -0.00 – 0.02 | 0.086 |
| age [old] | 0.03 | 0.02 – 0.05 | <0.001 |
| competence french [French average] |
0.05 | 0.02 – 0.07 | <0.001 |
| competence french [French high] |
0.09 | 0.07 – 0.11 | <0.001 |
| competence german [German average] |
-0.02 | -0.11 – 0.08 | 0.740 |
| competence german [German high] |
-0.02 | -0.11 – 0.07 | 0.694 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.004 / 0.339 | ||
Data: data
Models:
m2.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french
m3.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + competence_german
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m2.lmer 8 38324 38393 -19154 38308
m3.lmer 10 38328 38414 -19154 38308 0.2494 2 0.8828
Not significant! Removed from the model. ❌
Add predictor education.
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.31 | 0.24 – 0.39 | <0.001 |
| age [middle-aged] | 0.02 | 0.00 – 0.03 | 0.013 |
| age [old] | 0.04 | 0.02 – 0.05 | <0.001 |
| competence french [French average] |
0.03 | 0.01 – 0.06 | 0.002 |
| competence french [French high] |
0.07 | 0.05 – 0.09 | <0.001 |
| education [Classical secondary school] |
0.04 | 0.02 – 0.05 | <0.001 |
| education [University] | 0.03 | 0.02 – 0.04 | <0.001 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.34 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.005 / 0.339 | ||
Data: data
Models:
m2.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french
m3.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + education
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m2.lmer 8 38324 38393 -19154 38308
m3.lmer 10 38297 38382 -19138 38277 31.709 2 1.302e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Education is a significant contribution to the model. ✅ Speakers coming from a classical school or university, use significantly more French variants.
Now adding predictors relating to the location of the speaker, starting with the degree of urbanity.
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.33 | 0.25 – 0.41 | <0.001 |
| age [middle-aged] | 0.01 | -0.00 – 0.02 | 0.095 |
| age [old] | 0.02 | 0.01 – 0.04 | 0.002 |
| competence french [French average] |
0.03 | 0.01 – 0.05 | 0.004 |
| competence french [French high] |
0.06 | 0.04 – 0.08 | <0.001 |
| education [Classical secondary school] |
0.03 | 0.02 – 0.05 | <0.001 |
| education [University] | 0.03 | 0.02 – 0.04 | <0.001 |
| urbanity [rural areas] | -0.03 | -0.04 – -0.02 | <0.001 |
| urbanity [1
Stad Lëtzebuerg] |
0.04 | 0.02 – 0.06 | <0.001 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.33 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.007 / 0.339 | ||
Data: data
Models:
m3.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + education
m4.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + education + urbanity
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m3.lmer 10 38297 38382 -19138 38277
m4.lmer 12 38240 38343 -19108 38216 60.343 2 7.881e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Urbanity is a ignificant contribution to the model! ✅ Check the algebraic sign for the estimates: In rural areas significant less French variants are used. In the capital significant more French variants are used.
Add predictor socio-economic index; ranges from 0 to 1 and is based on the share of single parents, mean salary, share of persons with RMG, level of unemployment per commune. 0 = favorable commune, 1 = defavorable commune (see STATEC).
| French_origin | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.38 | 0.30 – 0.46 | <0.001 |
| age [middle-aged] | 0.01 | -0.00 – 0.02 | 0.111 |
| age [old] | 0.03 | 0.01 – 0.04 | <0.001 |
| competence french [French average] |
0.03 | 0.01 – 0.05 | 0.003 |
| competence french [French high] |
0.06 | 0.04 – 0.08 | <0.001 |
| education [Classical secondary school] |
0.03 | 0.02 – 0.05 | <0.001 |
| education [University] | 0.03 | 0.01 – 0.04 | <0.001 |
| urbanity [rural areas] | -0.04 | -0.05 – -0.03 | <0.001 |
| urbanity [1
Stad Lëtzebuerg] |
0.02 | 0.01 – 0.04 | 0.007 |
| socio index raw | -0.08 | -0.11 – -0.06 | <0.001 |
| Random Effects | |||
| σ2 | 0.15 | ||
| τ00 id | 0.01 | ||
| τ00 variable | 0.07 | ||
| ICC | 0.33 | ||
| N id | 3151 | ||
| N variable | 50 | ||
| Observations | 38452 | ||
| Marginal R2 / Conditional R2 | 0.008 / 0.339 | ||
Data: data
Models:
m4.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + education + urbanity
m5.lmer: French_origin ~ age + (1 | id) + (1 | variable) + competence_french + education + urbanity + socio_index_raw
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
m4.lmer 12 38240 38343 -19108 38216
m5.lmer 13 38206 38317 -19090 38180 36.432 1 1.581e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The socio-economic index is a significant contribution to the model. ✅ The higher the index, i.e. the less favorable the commune, the less French variants are used.
This is our final model. Speakers favoring French variants show the following social characteristics:
rather old
average to high competence in French
education in classical school or university
rather living in the capital, then in rural areas
low socio-economic index
See Plevoets (2020); Gilles (2023)
Run a regression with two predictors for all variants, instead for the variables. Group similar variants together in a two-dimensional space.
Run the correspondence regression analysis.
A Screeplot shows the amount of variation explained per so-called ‘latent variable’.
In this case, 65 % of variation is explained by the first ‘latent variable’ and 13 % by the second latent variable. A considerable amount of variation is thus explained through these two predictors, competence_french and age.
Now plot this in a two-dimensional space. Use green for variants with French origin and blue for variants with Germanic origin. In addition, plot also the values for age (young - middle-aged - old) and competence_french (low - average - high).
# Colors for plotting:
corr.col <- ifelse(xtabs(~variant + French_origin, data=input_data)[,"French"] > 0,
"green3", "blue")
plot(corr.crg, x_ell=TRUE, xsub=c("competence_french", "age"), col_btm=corr.col, col_top="orange",
cex_btm=0.7, cex_top=0.7, font_btm= 1, font_top=2, hlim=c(-0.7,1.2), vlim=c(-0.3,0.6),
hlab = "Latent variable 1: 'xxx'", vlab = "Latent variable 2: 'xxx'", add_ori=TRUE)
French variants tend to concentrate in the NE quadrant, and Germanic variants more in the lower half. Interestingly, green and blue form - cum grano salis - two clouds. Distances become visible, e.g. Buttek associated with ‘French high’ and Geschäft with ‘French low’.
R will help you to understand your data and find the best statistical analysis
Quarto will help you to write an analysis, a report or even an entire fully-fledged article in a journal layout - all in one place.